I’m sure all of you or atleast most of you are aware of what Airbnb is, however some of you might not know how it all started. Airbnb began in 2008 when two designers who had space to share hosted three travellers looking for a place to stay. Now, millions of hosts and travellers choose to create a free Airbnb account so they can list their space and book unique accommodations anywhere in the world. And Airbnb experience hosts share their passions and interests with both travellers and locals. (Airbnb website). Let me tell you about my experience. Back in 2015 while I was attending my second year of university in Toronto, I rented an appartment for a full year. However circumstances changed and I had to move out ten months in. Now someoone who has lived in downtown would know that it’s not easy to rent out a place during summer holidays especially for only two months. And so, I decided to list my apartment on Airbnb and I was able to cover my rent for the two months. In fact, I even made a little bit more. Between all of this, the one part I struggled with while using Airbnb was the pricing. Although Airbnb uses Smart Pricing now, it did not have a pricing algorithm at that time and so for that reason I’ll be assuming that it still does not (Only for modelling purposes). I tried to find listings around downtown Toronto but that didn’t help much either because I had to average out the price for listings around that area with similar features as mine such as the type of property (apartment, house, condominium, etc.), room type (shared, private, etc.), number of beds, number of bathrooms, accommodation, amenities and many more that I did not even think of. As a result of this and out of curiosity, I decided to pick out an Airbnb listings dataset (Kaggle) and extract crucial information on factors to create an algorithm that will help in predicting the price for future Airbnb listings based on those factors. Later in this report, I will discuss how this helps the company in revenue generation. My report relies on four components: 1) Data Wrangling (or cleaning), 2) Exploratory Data Anaysis (using plots and assumption tests) and 3) Predictive Modelling (using training and testing).
Let’s explore some data pre-processing techniques and apply them to our dataset. In our data set we have some variables that are not too relevant for the data analysis. Therefore, we will exclude these variables from our data set. Some of these include include ‘id’, ‘Description’, ‘Date of First Review’, ‘Last Review Date’, ‘Listing url’ and ‘Zip Code’. This leaves us with 18 predictors, some of them are continuous and most of them categorical. I made sure the dataset looks clean and organized and made sure there were no missing values. I have filtered out variable occurences that are relevant to our sample size. In addition, I have also added some new variables by extracting information between the text. Here is what our new dataset looks like:
## Log_Price Property_Type Room_Type Accommodates Bathrooms Bed_Type
## 1 5.010635 Apartment Entire home/apt 3 1 Real Bed
## 2 5.129899 Apartment Entire home/apt 7 1 Real Bed
## 3 4.976734 Apartment Entire home/apt 5 1 Real Bed
## 5 4.744932 Apartment Entire home/apt 2 1 Real Bed
## 6 4.442651 Apartment Private room 2 1 Real Bed
## 7 4.418841 Apartment Entire home/apt 3 1 Real Bed
## Cancellation_Policy City Number_of_Reviews Review_Scores Bedrooms Beds
## 1 strict NYC 2 100 1 1
## 2 strict NYC 6 93 3 3
## 3 moderate NYC 10 92 1 3
## 5 moderate DC 4 40 0 1
## 6 strict SF 3 100 1 1
## 7 moderate LA 15 97 1 1
## TV Internet Parking Kitchen Cleaning_Fee Profile_Pic Instant_Bookable
## 1 FALSE TRUE FALSE TRUE TRUE TRUE FALSE
## 2 FALSE TRUE FALSE TRUE TRUE TRUE TRUE
## 3 TRUE TRUE FALSE TRUE TRUE TRUE TRUE
## 5 TRUE TRUE FALSE TRUE TRUE TRUE TRUE
## 6 TRUE TRUE FALSE FALSE TRUE TRUE TRUE
## 7 TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Now that we have the modified dataset ‘new_df’, we will use that to perform further analysis. Main purpose of this analysis is to find the relation between ‘Log-Price’ and all other factors that influence the price. However, it is important that we first realize the trends within data and the popularity of each variable and popularity of each level within a variable (for categorical variables). To accomplish this we create plots representing each of these variables starting with the dependant variable which is the ‘Log_Price’.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 4.304 4.700 4.749 5.165 7.600
The median Log_Price is 4.70 and the data seems to be evenly distributed and the shape of the histogram and the boxplot is approximately symmetric. There are not many outliers. The mean, median and mode are not too distinct. Therefore, we can say that Log_Price has an aproximately normal distribution.
Now we will analyze the relation between Log Price and the relevant predictors.
Here’s a correlation plot from our dataset for all numeric variables even though some of them may be considered categorical such as the variable ‘Bedrooms’ which has 11 distinct levels. The correlation plot gives us an idea about the impact each of these variables might have over price. However, we can’t be too certain just by looking at this plot. So, we will use the data and perform tests that will tell us whether there is evidence of any connection between the predictors and Log_Price.
For the purpose of analysis, I have removed the property types which had 10 or less listings because the sample size is too small to conclude anything about those properties.
While looking at the plot we find that the most common types of listings are apartments, houses and condominiums. Now, we analyze the relationship between the property types and the price.
By looking at the chart, we can comfortably say that prices for luxury rentals like boats or condominiums or vacation homes are higher than those of regular apartments, cabins or hostels. We will now perform futher analysis on the impacts of these various property types on the final price of the listing.
##
## Call:
## lm(formula = Log_Price ~ Property_Type, data = new_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.9415 -0.4586 -0.0390 0.4253 2.8609
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.739497 0.003423 1384.489 < 2e-16 ***
## Property_TypeBed & Breakfast -0.226162 0.035449 -6.380 1.79e-10 ***
## Property_TypeBoat 0.373526 0.097802 3.819 0.000134 ***
## Property_TypeBoutique hotel 0.088967 0.096757 0.919 0.357843
## Property_TypeBungalow 0.028586 0.037989 0.752 0.451774
## Property_TypeCabin -0.120550 0.083590 -1.442 0.149262
## Property_TypeCamper/RV -0.323407 0.081671 -3.960 7.51e-05 ***
## Property_TypeCastle 0.628927 0.183892 3.420 0.000626 ***
## Property_TypeCondominium 0.201954 0.015246 13.246 < 2e-16 ***
## Property_TypeDorm -1.060956 0.064784 -16.377 < 2e-16 ***
## Property_TypeGuest suite -0.026812 0.066713 -0.402 0.687756
## Property_TypeGuesthouse -0.064291 0.032957 -1.951 0.051095 .
## Property_TypeHostel -1.180551 0.089453 -13.197 < 2e-16 ***
## Property_TypeHouse 0.002562 0.006752 0.379 0.704364
## Property_TypeIn-law 0.119811 0.083590 1.433 0.151772
## Property_TypeLoft 0.243639 0.021148 11.520 < 2e-16 ***
## Property_TypeOther 0.024867 0.032604 0.763 0.445651
## Property_TypeServiced apartment 0.255729 0.171198 1.494 0.135245
## Property_TypeTent -0.742379 0.177205 -4.189 2.80e-05 ***
## Property_TypeTimeshare 0.830698 0.115450 7.195 6.31e-13 ***
## Property_TypeTownhouse 0.036547 0.018758 1.948 0.051377 .
## Property_TypeVacation home 0.538618 0.270656 1.990 0.046591 *
## Property_TypeVilla 0.228331 0.059392 3.844 0.000121 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6629 on 56983 degrees of freedom
## Multiple R-squared: 0.01625, Adjusted R-squared: 0.01587
## F-statistic: 42.79 on 22 and 56983 DF, p-value: < 2.2e-16
After performing some analysis on the various property types and how they affect our dependant variable which is Log_price, we find out that some of these property types seem to influence the price greatly than the others. For example the p-values for the first three types: ‘Bed & Breakfast’, ‘Boat’ and ‘Boutique Hotel’ are significantly low. This suggests that we have sufficient evidence to reject the null hypothesis (there is no difference in price between the property types) and therefore, the type of property does affect the price.
Compared to an entire home/apartment or a private room, very few people on Airbnb prefer a shared room to rent.
In terms of price it seems like the price range for an entire house/apartment is higher than that of a private room and the average price of a shared room is the lowest. This seems like a reasonable assumption. However, to be certain about the connection between the type of rooms and price, we will look at the p-values.
##
## Call:
## lm(formula = Log_Price ~ Room_Type, data = new_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.7679 -0.3580 -0.0395 0.2969 3.0252
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.111566 0.002826 1809.0 <2e-16 ***
## Room_TypePrivate room -0.823525 0.004413 -186.6 <2e-16 ***
## Room_TypeShared room -1.343619 0.013809 -97.3 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5117 on 57003 degrees of freedom
## Multiple R-squared: 0.4137, Adjusted R-squared: 0.4137
## F-statistic: 2.011e+04 on 2 and 57003 DF, p-value: < 2.2e-16
The p-values for the coefficients suggests that all room types affect the price differently. With an entire home/apartment having a relatively higher price compared to a private room or a shared room.
By looking at the graph, we can assume that most of the listings in our dataset are for property types that accommodate between 1-6 people. There’s a very small number of properties that accommodate 7 or more people.
Let’s now look at 2 different plots displaying the relationship between the predictor ‘Accommodates’ and the response variable ‘Log_Price’.
We can see an increasing trend in the graphs of ‘Accommodates’ vs ‘Log_Price’. This means listings that accommodate more people have a higher price in general. Again, that seems like a reasonable assumption.
##
## Call:
## lm(formula = Log_Price ~ Accommodate, data = new_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.1153 -0.3424 -0.0170 0.3274 3.1980
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.115256 0.006781 606.84 <2e-16 ***
## Accommodate2 0.401571 0.007558 53.13 <2e-16 ***
## Accommodate3 0.643514 0.009492 67.79 <2e-16 ***
## Accommodate4 0.882905 0.008641 102.18 <2e-16 ***
## Accommodate5 1.067318 0.012119 88.07 <2e-16 ***
## Accommodate6 1.254916 0.010760 116.63 <2e-16 ***
## Accommodate7 1.372696 0.019962 68.76 <2e-16 ***
## Accommodate8 1.536085 0.015429 99.56 <2e-16 ***
## Accommodate9 1.523196 0.036385 41.86 <2e-16 ***
## Accommodate10 1.683195 0.023214 72.51 <2e-16 ***
## Accommodate11 1.684784 0.064071 26.30 <2e-16 ***
## Accommodate12 1.907182 0.037315 51.11 <2e-16 ***
## Accommodate13 1.734125 0.097795 17.73 <2e-16 ***
## Accommodate14 1.834271 0.055793 32.88 <2e-16 ***
## Accommodate15 1.794683 0.084400 21.26 <2e-16 ***
## Accommodate16 1.877427 0.036631 51.25 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5254 on 56990 degrees of freedom
## Multiple R-squared: 0.382, Adjusted R-squared: 0.3819
## F-statistic: 2349 on 15 and 56990 DF, p-value: < 2.2e-16
We can notice that the coefficients for each level of accommodation are fairly diffrent from each other and follow an increasing trend which confirms the earlier assumption that listings that accommodate more people have a higher price in general.
Looking at this graph, we can say that most properties offer either one or two bathrooms. We will now take a closer look at the plot of Bathrooms vs Log_Price to see if the number of bathrooms have an effect on the price of the property.
We can see a similar trend to what we saw in the plot for ‘Accommodates’. More bathrooms generally means a larger house/apartment and therefore the average price is higher for properties with more bathrooms.
##
## Call:
## lm(formula = Log_Price ~ Bathroom, data = new_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.6525 -0.4040 0.0015 0.4061 3.4708
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.13988 0.05127 80.748 <2e-16 ***
## Bathroom0.5 0.05084 0.07315 0.695 0.487
## Bathroom1 0.51262 0.05135 9.982 <2e-16 ***
## Bathroom1.5 0.57360 0.05254 10.918 <2e-16 ***
## Bathroom2 1.01084 0.05188 19.483 <2e-16 ***
## Bathroom2.5 1.28877 0.05447 23.660 <2e-16 ***
## Bathroom3 1.31785 0.05571 23.656 <2e-16 ***
## Bathroom3.5 1.87155 0.06283 29.788 <2e-16 ***
## Bathroom4 1.35563 0.06643 20.406 <2e-16 ***
## Bathroom4.5 2.31647 0.08762 26.436 <2e-16 ***
## Bathroom5 2.03705 0.10076 20.216 <2e-16 ***
## Bathroom5.5 2.76293 0.13187 20.952 <2e-16 ***
## Bathroom6 2.30178 0.17332 13.280 <2e-16 ***
## Bathroom6.5 2.57120 0.22494 11.430 <2e-16 ***
## Bathroom7 2.42408 0.28175 8.604 <2e-16 ***
## Bathroom7.5 3.30939 0.31396 10.541 <2e-16 ***
## Bathroom8 -0.03508 0.13409 -0.262 0.794
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6195 on 56989 degrees of freedom
## Multiple R-squared: 0.1408, Adjusted R-squared: 0.1406
## F-statistic: 583.8 on 16 and 56989 DF, p-value: < 2.2e-16
The p-values are significant across all levels of the factor variable ‘Bathrooms’. Also, the difference in coefficients across these levels tells us that the price changes with an increase or decrease in the number of bathrooms.
## Airbed Couch Futon Pull-out Sofa Real Bed
## 340 158 599 491 55418
We can see that the most common type of bed amongst all properties is the ‘Real bed’. However, we can’t tell much about it’s relationship with the price by looking at the plot above as it seems like all other types of beds result in an average listing price very close to each other.
##
## Call:
## lm(formula = Log_Price ~ Bed_Type, data = new_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.7614 -0.4439 -0.0454 0.4148 3.0503
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.22900 0.03600 117.462 < 2e-16 ***
## Bed_TypeCouch -0.07111 0.06392 -1.112 0.266
## Bed_TypeFuton 0.05259 0.04508 1.167 0.243
## Bed_TypePull-out Sofa 0.20876 0.04684 4.457 8.33e-06 ***
## Bed_TypeReal Bed 0.53235 0.03611 14.741 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6639 on 57001 degrees of freedom
## Multiple R-squared: 0.01312, Adjusted R-squared: 0.01305
## F-statistic: 189.5 on 4 and 57001 DF, p-value: < 2.2e-16
The p-values and the coeffiecients point us to the fact that having a real bed or a pull-out sofa has an upward affect on the price of the property. An air bed, couch or a futon lower the average price of the property. However, there’s not a noticable difference between the average price of properties with an airbed, futon or a couch.
For the purpose of analysis, we will onlly look at the three prominent policies, which are ‘Flexible’, ‘Moderate’ and ‘Strict’.
It’s hard to tell whether having a different policy makes a difference in the price. It’s persumable that a property with a flexible policy might be more pricey in a sense that clients might pay a little more to avoid the risk of heavy cancellation charges. On the other hand, it may also cost less to rent a property with a flexible cancellation policy as a landlord of a property not so popular (low-priced) might want to add the benefit of a flexible policy to attract more clients.
##
## Call:
## lm(formula = Log_Price ~ Cancellation_Policy, data = new_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.8647 -0.4560 -0.0364 0.4254 2.9500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.556558 0.005775 788.95 <2e-16 ***
## Cancellation_Policymoderate 0.147900 0.007721 19.16 <2e-16 ***
## Cancellation_Policystrict 0.308130 0.006996 44.05 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6566 on 57003 degrees of freedom
## Multiple R-squared: 0.03463, Adjusted R-squared: 0.03459
## F-statistic: 1022 on 2 and 57003 DF, p-value: < 2.2e-16
The three different policies have different average slopes and the p-values are also significant which means that each of the policies affect the price differently.
So, about 75% of the properties have a cleaning fee associated to them. We will now see if having a cleaning fee affects the property price.
##
## Two Sample t-test
##
## data: Log_Price by Cleaning_Fee
## t = -44.001, df = 57004, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.3132919 -0.2865715
## sample estimates:
## mean in group FALSE mean in group TRUE
## 4.510270 4.810202
A significant p-value and a larger mean price for having a claning fee associated to the listing, suggests that having a cleaning fee increases the price of the property.
Let’s now look at the different cities and see how each city is doing in terms of business for Airbnb. Here’s a plot comparing the total number of proprerty listings from each city.
Amongst the 6 cities mentioned above, New York and LA seem to have the most number of listings. This could be because of the city size, population, attractions or other factors.
I will now examine the price comparison between each city and find out whether the predictor ‘City’ has an effect on the dependent variable or not.
## # A tibble: 6 x 3
## City avg_price Count
## <fct> <dbl> <int>
## 1 Boston 4.86 2779
## 2 Chicago 4.59 3198
## 3 DC 4.79 4052
## 4 LA 4.70 17083
## 5 NYC 4.71 24878
## 6 SF 5.12 5016
Grouping our data by city, we find that the average price differs across cities. Except for LA and New York which have almost the same average price.
##
## Call:
## lm(formula = Log_Price ~ City, data = new_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.7098 -0.4614 -0.0559 0.4062 2.8983
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.85583 0.01246 389.812 < 2e-16 ***
## CityChicago -0.26913 0.01703 -15.803 < 2e-16 ***
## CityDC -0.06339 0.01617 -3.919 8.9e-05 ***
## CityLA -0.15574 0.01343 -11.594 < 2e-16 ***
## CityNYC -0.14599 0.01313 -11.115 < 2e-16 ***
## CitySF 0.25940 0.01553 16.704 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6567 on 57000 degrees of freedom
## Multiple R-squared: 0.03439, Adjusted R-squared: 0.0343
## F-statistic: 406 on 5 and 57000 DF, p-value: < 2.2e-16
Performing further analysis, we find out that each city affects the price differently. This is evident through the significant p-values and the coefficient estimates for each city. Estimates are different across cities except for LA and New York. These are both large cities with similar demographics and we can expect them to have a similar demand.
## Mode FALSE TRUE
## logical 233 56773
Almost all of the hosts have a profile picture. This may be a very important part of creating a profile or listing a property. Let’s look at how this affects the price.
##
## Two Sample t-test
##
## data: Log_Price by Profile_Pic
## t = 0.90882, df = 57004, p-value = 0.3635
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.04611326 0.12584844
## sample estimates:
## mean in group FALSE mean in group TRUE
## 4.788380 4.748512
After applying the two-samples t-test, we find out that the p-value is significantly large and that we do not have enough evidence to reject the null hypothesis which states that the true difference in means is equal to 0. Also, the mean price is hardly different between having a profile pic or not and the sample of people without a profile picture is relatively too small to conclude if the price depends on having a profile picture.
Most of the properties can not be booked instantly. This is due to several factors including security or availability of space. We will now find out if being able to book a property instantly does impact the price of the property.
##
## Two Sample t-test
##
## data: Log_Price by Instant_Bookable
## t = 12.011, df = 57004, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.06338561 0.08810650
## sample estimates:
## mean in group FALSE mean in group TRUE
## 4.769023 4.693277
Although, we have enough evidence to reject the null hypothesis, the difference between the two mean prices is approximately 1.5 %. Given this, it is hard to conclude whether difference was caused by the dependance on property being instantly bookable or due to some other reason.
The histogram shows us that more than half of the properties listed in our dataset have less than 20 reviews and most properties have less than 40 reviews except for a few which have more than 40-60 reviews. We will now find out if the price depends on the number of reviews or not.
##
## Pearson's product-moment correlation
##
## data: new_df$Log_Price and new_df$Number_of_Reviews
## t = -1.863, df = 57004, p-value = 0.06247
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.0160106792 0.0004062919
## sample estimates:
## cor
## -0.007802719
Reviews can be good or bad, regardless it is hard to confirm the dependence of price just based on the number of reviews. The correlation estimate is -0.007 which is very close to 0. There’s hardly any association between the two variables. Review scores might however be a better predictor to look at how these reviews affect our final price.
Looking at the ratings, we notice that most of the properties have a rating 80 or higher. This is a positive result for Airbnb as most clients are satisfied with the service they have received. Let’s find out if the price depends on these ratings or not.
##
## Pearson's product-moment correlation
##
## data: new_df$Log_Price and new_df$Review_Scores
## t = 21.999, df = 57004, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.08360440 0.09988417
## sample estimates:
## cor
## 0.09175041
There is a positive correlation between the ratings and the price which is about 0.1. In addition, the p-value is significantly small to reject the null hypothesis. This suggests, as the value of ‘Review_Scores’ increases so does the variable ‘Log_Price’.
More than 85% of our dataset is comprised of listings with 1,2 or 3 bedrooms. Most of them have 1 bedroom. However, there are also listings that have 4,5,6 or 7 bedrooms. The number of listings with more than 7 bedrooms is very limited. Let’s find out if the price depends on the number of bedrooms or not.
Just by looking at the plot we do observe an increasing tend betwwen the number of bedrooms and the price. This means that properties with more bedrooms are generally priced more. Our assumption seems logical. We will now perform a regression analysis to look more deeply inide the reltionship between the price and number of bedrooms.
##
## Call:
## lm(formula = Log_Price ~ Bedroom, data = new_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5126 -0.3432 -0.0128 0.3516 3.0878
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.785844 0.007433 643.898 < 2e-16 ***
## Bedroom1 -0.273261 0.007934 -34.443 < 2e-16 ***
## Bedroom2 0.442318 0.009376 47.174 < 2e-16 ***
## Bedroom3 0.797578 0.011995 66.492 < 2e-16 ***
## Bedroom4 1.160534 0.018840 61.598 < 2e-16 ***
## Bedroom5 1.403465 0.034709 40.435 < 2e-16 ***
## Bedroom6 1.623537 0.068615 23.662 < 2e-16 ***
## Bedroom7 1.772148 0.097523 18.171 < 2e-16 ***
## Bedroom8 1.491360 0.163410 9.126 < 2e-16 ***
## Bedroom9 1.567817 0.270806 5.789 7.10e-09 ***
## Bedroom10 1.281712 0.242239 5.291 1.22e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5414 on 56995 degrees of freedom
## Multiple R-squared: 0.3437, Adjusted R-squared: 0.3436
## F-statistic: 2985 on 10 and 56995 DF, p-value: < 2.2e-16
Between the bedrooms 1 and 7, the price increases continually. The p-values are significantly small and the difference between the coefficients clearly tells us that the price of a property varies with the number of bedrooms. More bedrooms means a higher price generally except for a few anomalies. In those cases, other factors are more prominent.
We found out that the number of bedrooms has an increasing effect on the price. We’ll now look at the ‘Beds’ variable and see if it has a similar effect on the price. We assume it does because beds and bedrooms are compliments as you can’t have more bedrooms without having more beds or vice versa. Let’s see if that is the case.
Again, our observations for this variable are pretty similar to what we found out earlier for the number of bedrooms. Most of the listings are comprised of 1,2 or 3 beds. Having more beds would generally mean a higher price. let’s find out if that is true.
We observe that the price is increasing as the number of beds increase, especially between 1 and 7 bedrooms and that is where most of the data is distributed. So we’re pretty confident about the positive correlation between price and the number of beds.
##
## Call:
## lm(formula = Log_Price ~ Bed, data = new_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.4964 -0.3765 0.0034 0.3664 3.1040
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.6913 0.5725 8.195 2.57e-16 ***
## Bed1 -0.1949 0.5725 -0.340 0.7335
## Bed2 0.2390 0.5725 0.418 0.6763
## Bed3 0.5590 0.5725 0.976 0.3289
## Bed4 0.7367 0.5726 1.286 0.1983
## Bed5 0.9687 0.5728 1.691 0.0908 .
## Bed6 0.9521 0.5730 1.662 0.0966 .
## Bed7 1.1910 0.5741 2.075 0.0380 *
## Bed8 0.5451 0.5744 0.949 0.3426
## Bed9 1.1910 0.5772 2.064 0.0391 *
## Bed10 0.4275 0.5766 0.741 0.4585
## Bed11 1.2293 0.5866 2.095 0.0361 *
## Bed12 1.2160 0.5834 2.084 0.0371 *
## Bed13 1.0990 0.6035 1.821 0.0686 .
## Bed14 0.3180 0.7012 0.454 0.6502
## Bed15 0.8520 0.6401 1.331 0.1832
## Bed16 0.5872 0.5848 1.004 0.3153
## Bed18 1.9933 0.8096 2.462 0.0138 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5725 on 56988 degrees of freedom
## Multiple R-squared: 0.2663, Adjusted R-squared: 0.266
## F-statistic: 1216 on 17 and 56988 DF, p-value: < 2.2e-16
We are certain that the price responds differently to properties with less number of beds than those with more beds. The coeeficient estimate for each number of beds is different from the other. This means each of them impact the price differently.
As far as the amenities are concerned I will not go into deep analysis for each one of them. However I will look into how each of them affect the price, starting with TV.
The plot shows that the mean price for properties without a TV is lower than those that come with a TV.
##
## Two Sample t-test
##
## data: Log_Price by TV
## t = -68.823, df = 57004, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.4320314 -0.4081053
## sample estimates:
## mean in group FALSE mean in group TRUE
## 4.439361 4.859429
A very low p-value indicates that there is a significant difference in the mean price between properties with a TV and without a TV. Therefore, having a TV generally does increase the price of the property.
Having internet is a huge factor. At least I would not rent a place without internet. I’m surprised that the difference between the two means is not too high just by looking at the plot. Let’s perform an unpaired t-test to verify our results.
##
## Two Sample t-test
##
## data: Log_Price by Internet
## t = -13.676, df = 57004, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.3072511 -0.2302221
## sample estimates:
## mean in group FALSE mean in group TRUE
## 4.485487 4.754224
There’s actually a good difference between the two means and the p-value is also significantly low which suggests that normally a property without internet has a lower price than a property that includes internet service.
The plot fails to show any significant difference in the mean price between a property with free parking and a property without one. Also, it is evident that out of the listed properties, most of them do not provide parking or at least it is not free. This is a reasonable assumption as a lot of people that rent on Airbnb are travellers or visitors so they may not need a car parking. Let’s perform a t-test and find out.
##
## Two Sample t-test
##
## data: Log_Price by Parking
## t = -8.898, df = 57004, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.06474298 -0.04136924
## sample estimates:
## mean in group FALSE mean in group TRUE
## 4.731333 4.784389
Although there is a difference between the two means, it’s clearly not large enough to conclude that having a parking effects the price in any significant way.
Kitchen is essentially one of the most important parts of a rental property. It is really surprising that there are actually this many properties without a kitchen in our dataset.
##
## Two Sample t-test
##
## data: Log_Price by Kitchen
## t = -30.798, df = 57004, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.3312104 -0.2915756
## sample estimates:
## mean in group FALSE mean in group TRUE
## 4.462852 4.774245
Well, there is no doubt that a property without a kitchen would be listed for much less than a property with one. A very low p-value, the difference between the mean prices and the plot clearly simplifies our assumption.
Now that we are done with the exploratory data analysis, let’s move on towards the final part of this report. In this part, I will introduce Predictive Modelling using machine learning techniques.
Everyday millions of users visit the Airbnb website or browse the application for rental properties. Some of these properties are overpriced and some of them are so underpriced that they get booked in advance even before people who are willing to pay more can decide. This cuts on the revenue for landlords as well as the revenue that Airbnb makes out of it’s 3 percent fee charged to the property owner. As this is the main stream of earning profits for the company and millions of property owners rent out their property everyday, Airbnb loses on large amounts of profit just by losing on the difference in that 3 percent revenue commission. At the other hand, Airbnb also loses on overpriced properties when nobody is ready to pay for them. To solve this problem of miscommunication between the company and property owners, I decided to design a price prediction model that will assist the company with price automation for future Airbnb property listings. I’m not saying that the company has to automate prices, it just needs to have an algorithm that does it for them. With the help of this model, Airbnb can recommend an automated price based on the features of the property. Eventually, people who are renting their properties for less will become aware and raise their price and the ones that are overselling will have to bring their price down to meet the market demands.
After applying multiple model selection algorithms and performing exhaustive assumption tests to check for multicollinearity within the model, I decided to include only the most important predictors. These include:
## Significant_Predictors
## 1 Property Type
## 2 Room Type
## 3 Accommodates
## 4 Bathrooms
## 5 City
## 6 Bed Type
## 7 Cancellation Policy
## 8 Cleaning Fee
## 9 Review Scores Rating
## 10 Beds
## 11 TV
## 12 Internet
## 13 Kitchen
The nonsignificant predictors are:
## Nonsignificant_predictors
## 1 Number of Reviews
## 2 Instant Bookable
## 3 Profile Picture
## 4 Bedrooms
## 5 Parking
Now that we know which variables to include in the model, I have compiled a data set named ‘final_data’. This includes log_price and columns with only the significant predictors. For Bathrooms, Beds and Accommodates we will consider them as categorical variables instead of continuous because we have a discrete selection from each of those predictors. What I mean is that the variable ‘Bathroom’ has 17 levels, the variable ‘Beds’ has 18 levels and the variable ‘Accommodates’ has 16 levels. The predictor ‘Bedrooms’ may seem like an important one but the reason I excluded that is because it is directly proportional to the number of beds and including it alongside with ‘Beds’ will result in multicollinearity within the model. Here is the new dataset:
## Log_Price Property_Type Room_Type City Bed_Type
## 1 5.010635 Apartment Entire home/apt NYC Real Bed
## 2 5.129899 Apartment Entire home/apt NYC Real Bed
## 3 4.976734 Apartment Entire home/apt NYC Real Bed
## 4 4.744932 Apartment Entire home/apt DC Real Bed
## 5 4.442651 Apartment Private room SF Real Bed
## 6 4.418841 Apartment Entire home/apt LA Real Bed
## Cancellation_Policy Cleaning_Fee Review_Scores TV Internet Kitchen
## 1 strict TRUE 100 FALSE TRUE TRUE
## 2 strict TRUE 93 FALSE TRUE TRUE
## 3 moderate TRUE 92 TRUE TRUE TRUE
## 4 moderate TRUE 40 TRUE TRUE TRUE
## 5 strict TRUE 100 TRUE TRUE FALSE
## 6 moderate TRUE 97 TRUE TRUE TRUE
## Accommodates Bathrooms Beds
## 1 3 1 1
## 2 7 1 3
## 3 5 1 3
## 4 2 1 1
## 5 2 1 1
## 6 3 1 1
I will now seperate the data into training and testing, then use the random Forest method on the training set to train the model. Once the model is trained, I wil apply that to the testing set to predict the dependent variable which is Log_Price. I will also be tuning the model by using the ‘grid search’ method and adding parameters for better results.
After applying mutiple parameters and using a for loop of grid search with various combinations of trees and model selection metrics, I used the following model:
library(caret)
# ensure results are repeatable
set.seed(123)
inTraining <- createDataPartition(final_data$Log_Price, p = .8, list = FALSE)
training <- final_data[ inTraining,]
testing <- final_data[-inTraining,]
# Manual Grid Search
control <- trainControl(method="repeatedcv",
number=5,
repeats=1,
search="grid")
tunegrid <- expand.grid(.mtry=8)
model <- train(Log_Price ~.,
data=training,
method="rf",
metric="RMSE",
tuneGrid=tunegrid,
trControl=control,
ntree=200)
#R-squared training
predicted_tr <- predict(model, newdata=training, select = -c(Log_Price))
actual_tr <- training$Log_Price
rsq_tr <- 1-sum((actual_tr-predicted_tr)^2)/sum((actual_tr-mean(actual_tr))^2)## [1] "Training R-squared is: 0.65"
R-squared is the proportion of the variance in the dependent variable that is predictable from the independent variable. In, short R-squared is a meausure of model accuracy and lies between 0 and 1. A larger R-squared means a larger amount of variation in the dependent variable is predictable. But that’s not always the case; R-squared can be overestimated by overfitting, bias or multi-collinearity.
Let’s now measure the model performance by finding R-squared for the testing data:
# predict the outcome of the testing data
predicted <- predict(model, newdata=testing, select = -c(Log_Price))
actual <- testing$Log_Price
rsq_tst <- 1-sum((actual-predicted)^2)/sum((actual-mean(actual))^2)## [1] "Testing R-squared is: 0.61"
The training and testing R-squared are 0.65 and 0.61 respectively which explains that the model is consistent. In terms of accuracy, more than 60% of the the variation in the dependent variable is explained by the independent variables. Firstly, this model is just a prototype of a larger model that Airbnb will use and secondly, prices do not just depend on the features of the property but also on seasonal variation and nightly demand. Considering this and that our model was still able to explain more than 60 % of the variation in the response variable, I think we did a good job.
Now that we have crafted the algorithm, we can use this towards doing some actual prediction and check further how accurate the model is by comparing actual values with the predicted results from the testing dataset. Here’s a table of six random observations from the testing dataset.
## Actual Data Predicted Data
## 1 4.997212 4.790072
## 2 4.356709 4.396311
## 3 5.293305 5.204282
## 4 4.174387 4.257141
## 5 5.631212 5.257442
## 6 5.105945 5.100946
Data Wrangling
Cleaned up the data, removed missing values, filtered the data, and reorganized.
Exploratory Data Analysis
Examined plots and completed assumption tests and regression analysis to explore the effects of potential predictors.
Machine Learning
Built a model and trained it on data, using that for prediction analysis. Examined model accuracy.